Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-24868: [C++] Add a Tensor logical value type with varying dimensions, implemented using ExtensionType #37166

Merged
merged 18 commits into from
Oct 11, 2023

Conversation

rok
Copy link
Member

@rok rok commented Aug 15, 2023

Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a VariableShapeTensorType.
See #24868 and huggingface/datasets#5272

What changes are included in this PR?

This introduces definition of arrow.variable_shape_tensor extension and it's C++ implementation and a Python wrapper.

Are these changes tested?

Yes.

Are there any user-facing changes?

This introduces new extension type to the user.

@github-actions
Copy link

⚠️ GitHub issue #24868 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked through the rst portion and have a few questions.

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved
docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 24, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 25, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 25, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 25, 2023
@rok
Copy link
Member Author

rok commented Aug 25, 2023

Looking at PyTorch NestedTensor it stores strides per tensor as opposed to our proposal which stores just a permutation for the entire array which can then be used to derive strides (using tensor shape) per tensor. In PyTorch if strides are not provided to the constructor they are apparently calculated assuming row-major/C-contiguous layout.

This makes me assume strides are stored for caching (and not to allow variable data layout per tensor).

Should we adjust to store permutations (or strides) per row too or rather leave that for another extension (GeneralVariableShapeTensorType)? @lhoestq @mariosasko @AlenkaF @jorisvandenbossche @ezyang

@lhoestq
Copy link

lhoestq commented Aug 28, 2023

No strong opinion regarding storing strides. Cc @thomasw21 I'm not even sure that nested tensors are used that often no ? I'm also pinging other experts to get more insights.

@AlenkaF
Copy link
Member

AlenkaF commented Aug 28, 2023

How much extra complexity is added if we enable each tensor in an array to have it's own permutation defined?

@rok
Copy link
Member Author

rok commented Aug 28, 2023

@AlenkaF Storing permutations would allow per tensor memory layout, which I'm not sure is really needed in practice and could be confusing and complex.
Storing strides (and perhaps requiring uniform permutation) could be useful as consumers wouldn't need to calculate strides when using tensors. Complexity wouldn't really increase in this case.
Either way the size of type would increase by n tensors x ndim x sizeof(int64).

@rok rok force-pushed the ARROW-8714 branch 2 times, most recently from 0c43cd1 to 3390915 Compare August 29, 2023 08:36
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 5, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 5, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 5, 2023
@pitrou
Copy link
Member

pitrou commented Oct 5, 2023

@jorisvandenbossche Can I let you give this the final review?

@rok
Copy link
Member Author

rok commented Oct 6, 2023

@jorisvandenbossche can we merge this?

Comment on lines 219 to 221
- Example of minimal metadata is:

``{}``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one more small nitpick: the minimal metadata is actually no metadata, which is typically represented as an empty string (I am actually not fully sure if in this case the metadata key could also just not be present in the field metadata), instead of an empty json dict (I don't think we should necessarily recommend using an empty dict)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point! Empty string feels like the safer choice here. See my suggested change below.

@rok rok requested a review from jorisvandenbossche October 9, 2023 15:54
@AlenkaF AlenkaF added this to the 14.0.0 milestone Oct 11, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting merge Awaiting merge and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 11, 2023
@jorisvandenbossche jorisvandenbossche merged commit a7fab04 into apache:main Oct 11, 2023
@jorisvandenbossche jorisvandenbossche removed the awaiting merge Awaiting merge label Oct 11, 2023
@jorisvandenbossche
Copy link
Member

Thanks a lot @rok !

raulcd pushed a commit that referenced this pull request Oct 11, 2023
…ns, implemented using ExtensionType (#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See #24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: #24868

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit a7fab04.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…mensions, implemented using ExtensionType (apache#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See apache#24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: apache#24868

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…mensions, implemented using ExtensionType (apache#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See apache#24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: apache#24868

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…mensions, implemented using ExtensionType (apache#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See apache#24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: apache#24868

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Add a Tensor logical value type with varying dimensions, implemented using ExtensionType
9 participants